Optimizing knowledge distillation models for language models
Annotation
The problem of optimizing large neural networks is discussed using the example of language models. The size of large language models is an obstacle to their practical application in conditions of limited amounts of computing resources and memory. One of the areas of compression of large neural network models being developed is knowledge distillation, the transfer of knowledge from a large teacher model to a smaller student model without significant loss of result accuracy. Currently known methods of distilling knowledge have certain disadvantages: inaccurate knowledge transfer, long learning process, accumulation of errors in long sequences. The methods that contribute to improving the quality of knowledge distillation in relation to language models are proposed: selective teacher intervention in the student’s learning process and low-level adaptation. The first approach is based on the transfer of teacher tokens when teaching a student to neural network layers, for which an exponentially decreasing threshold of measuring the discrepancy between the probability distributions of the teacher and the student is reached. The second approach suggests reducing the number of parameters in a neural network by replacing fully connected layers with low-rank ones, which reduces the risk of overfitting and speeds up the learning process. The limitations of each method when working with long sequences are shown. It is proposed to combine methods to obtain an improved model of classical distillation of knowledge for long sequences. The use of a combined approach to distilling knowledge on long sequences made it possible to significantly compress the resulting model with a slight loss of quality as well as significantly reduce GPU memory consumption and response output time. Complementary approaches to optimizing the knowledge transfer process and model compression showed better results than selective teacher intervention in the student learning process and low- rank adaptation separately. Thus, the quality of answers of the improved classical knowledge distillation model on long sequences showed 97 % of the quality of full fine-tuning and 98 % of the quality of the low-rank adaptation method in terms of ROGUE-L and Perplexity, given that the number of trainable parameters is reduced by 99 % compared to full fine-tuning and by 49 % compared to low-rank adaptation. In addition, GPU memory usage is reduced by 75 % and 30 %, respectively, and inference time by 30 %. The proposed combination of knowledge distillation methods can find application in problems with limited computational resources.
Keywords
Постоянный URL
Articles in current issue
- Apochromatic objective for imaging spectral systems of visible, near and short-wave infrared spectrum ranges
- Application of the cross-gain modulation in erbium-doped fiber to increase the effective spectral bandwidth of an interrogator
Nonlinear transmission of fluorophosphate glass with quantum dots of cadmium and lead sulfides and selenides under near-IR femtosecond laser irradiation
Methodology for estimation of sensitivity to vibration of optical components based on wavelet analysis of vibration-modulated radiation
- Characterization of Ar:N2 plasma mixture with optical emission spectroscopy during deposition of NbN coating
- Spectral diagnostics of Al-Ni alloys under laser irradiation: effect of laser energy on plasma parameters
- Application of anamorphic optics system and a high-speed line scan photodetector in an open-type relative encoder
- A structural study of N-(2-(2-(2-azidoethoxy)ethoxy)ethyl)-4,6-di(aziridin-1-yl)-1,3,5-triazin-2-amine by density functional theory calculations
- A method for generating digital avatar animation with speech and non-verbal synchronization based on bimodal data
- Leveraging machine learning for profiling IoT devices to identify malicious activities
- Font generation based on style and character structure analysis using diffusion models
- Anomaly detection under data scarcity and uncertainty using zero-shot and few- shot approaches
- The impact of adversarial attacks on a computer vision models perception of images
- Set intersection protocol with privacy preservation
- K-sparse encoder for efficient information retrieval
- Comparative analysis method for time series data objects represented as sets of strings based on de Bruijn graphs
- Application of modern methods for information security risks evaluation of a critical information infrastructure facility
- Algorithm for human interaction with a model of an industrial cyber-physical system by means of neural interface
- An improved authentication protocol for self-driving vehicles based on Diffie–Hellman algorithm
- Simulation and analytical model of reliability with possible replication of transmissions in a reconfigurable multipath wireless network
- Evaluating tram positioning accuracy on curves based on map data and segmented images
- Building an optimal refueling plan using aggregated information about route parameter values from open sources
- Hermite–Gauss wavelets: synthesis of discrete forms and investigation of properties